9 September 2015

Basic biology

Basic biology

Basic biology

Sodium bisulfite treatment of DNA

Assays

Bisulfite-based assays

Going to focus on contemporary high-throughput assays based on sodium bisulfite treatment of DNA

Microarrays

  • Illumina 27k
  • Illumina 450k
  • Human only (mostly)

Sequencing

  • Whole-genome bisulfite-sequencing (WGBS/BS-seq/methylC-seq)
  • Targeted methods
    • (Enhanced/Extended) Reduced representation bisulfite-sequencing (eRRBS/RRBS)
    • Capture + bisulfite-sequencing (Roche SeqCap Epi system)
  • Any organism (mostly)

Non-bisulfite assays

Based on an enrichment/pulldown of methylated DNA and/or restriction enzymes

  • Methylated DNA immunoprecipitation + microarray/sequencing (MeDIP-microarray/MeDIP-seq/mDIP-seq)
  • Methylation-sensitive restriction enzyme + sequencing (MRE-seq)
  • Methyl binding domain protein-enrichment + sequencing (MBD-seq)

The analysis pipeline

Most assay-specific:

  • Getting data into R
  • Pre-processing
  • Analysis

Somewhere in between:

  • Batch effects

Less assay-specific:

  • Visualisation
  • Data integration

How can Bioconductor help?

Bioconductor 3.1 packages (based on DNAMethylation BiocViews):

  • 42 software packages
  • 6 annotation packages

Disclaimer

I can't tell you everything

  • 25 minutes
  • I don't know everything!

Will tell you:

  • What I find useful as a fairly well-experienced user & developer
  • What I am most familiar with
  • Where to find out more

Microarrays

Description of assays

What these measure

  • Red + green fluorescence intensities reflecting methylated and unmethylated signal

Illumina 27k

  • Infinium HumanMethylation27 BeadChip
  • ~28,000 cytosines
  • Mostly in promoters of ~15,000 genes
  • Infinium I probes
  • Deprecated? But much of TCGA data uses this platform.

Illumina 450k

  • Infinium HumanMethylation450 BeadChip
  • ~486,000 cytosines
  • Promoters, gene bodies, 3' UTR, intergenic
  • 135k Infinium I and 350k Infinium II probes
  • An overview of 450k technology

Key Bioconductor packages

Data ingest

File formats

  • .idat files
  • Files returned by Illumina's BeadStudio

minfi

  • read.450k(), read.450k.exp(), read.450k.sheet()
  • readTCGA()
  • readGEORawFile()

Pre-processing

Quality control

Pre-processing

Standard microarray issues

  • Failed probes
  • Cross-reactive probes
  • Background correction
  • Colour (dye) bias adjustment
  • Normalisation

450k-specific issues

  • Type I and II probes are very different
  • CpG-SNPs

Pre-processing

Downstream analyses

\(\beta\)-values vs. \(\mathcal{M}\)-values

\(\beta = \frac{M}{M + U + 100} \in [0, 1]\)

\(\mathcal{M} = \log(\frac{M + 1}{U + 1}) \in [-\infty, \infty]\)

"We recommend using the \(\mathcal{M}\)-value method for conducting differential methylation analysis and including the \(\beta\)-value statistics when reporting the results to investigators." (Du, P. et al. BMC Bioinformatics 11, 587 (2010))

Differential methylation

Differentially methylated probes (DMPs)

  • For a given probe, are the group-average level(s) of methylation different?
  • limma is the workhorse, e.g., minfi::dmpFinder()
  • OPINION: You want a pretty good reason not to use a limma-based approach

Differentially methylated regions (DMRs)

  • Identify regions with different group-average level(s) of methylation
  • OPINION: DMR finding/testing is somewhat ad hoc, but getting better
  • E.g., see minfi::blockFinder() with the bumphunter::bumphunter() backend

Differential variability

  • For a given probe, are the within-group variances of methylation levels different?
  • missMethyl::varFit, missMethyl::topVar() based on limma

Sequencing

Description of assays

Description of assays

What these measure

Single-base resolution data of cytosine methylation

Whole-genome bisulfite-sequencing

  • Gold standard
  • Genome-wide assay (~25,000,000 CpGs in human)
  • Expensive
  • 2-30x sequencing coverage

Targeted bisulfite-sequencing

  • Reduced representation bisulfite-sequencing and SeqCap
  • Cheaper
  • 20-60x sequencing coverage

Key Bioconductor packages

Still being figured out …

OPINION

  • bsseq for whole-genome bisulfite-sequencing
  • BiSeq for reduced representation bisulfite-sequencing
  • RnBeads for comprehensive WGBS/RRBS pipeline

There are also non-R/Bioconductor options

Non-R preliminaries

Pipeline (including most pre-processing)

Input to Bioconductor

  • Some aligner-specific file format with the following data per-sample:
chr  pos M  U
chr7 666 13 2
chr7 685 12 0

Pre-processing

  • Most done prior to reading data into R
  • More that ought to probably be done

Some issues

  • CpGs overlapping genetic variants
  • Copy number variation
  • Normalisation (nothing much yet available)

Downstream analyses with bisulfite-sequencing data

What you can do with a microarray plus more

Smoothing \(\beta\)-values

Cartoon (see e.g., bsseq::BSmooth() for a proper implementation)

Smoothing \(\beta\)-values

Cartoon (see e.g., bsseq::BSmooth() for a proper implementation)

Differentially methylated regions

Cartoon (see e.g., bsseq::BSmooth.tstat() and bsseq::dmrFinder() for a proper implementation)

Differentially methylated regions

Cartoon (see e.g., bsseq::BSmooth.tstat() and bsseq::dmrFinder() for a proper implementation)

Other downstream analyses

General issues

Batch effects and unwanted variation

Yes, they affect DNA methylation data!

Methods better developed for microarray methylation data

  • sva: General methods for removing batch effects and other unwanted variation in high-throughput experiments.
  • missMethyl: Includes RUVadj() and RUVfit() specific to 450k data.

Cell-type heterogeneity

Visualisation: coMET

Visualisation: epivizr

Data integration

Summary

  • Bioconductor contains many packages for analysing DNA methylation data

Microarray

  • Analysis pipeline is well-established
  • OPINION: Start with minfi

Sequencing

  • More work done outside of R/Bioconductor
  • OPINION: Move to Bioconductor for exploratory data analysis, inference, visualisation, and integration.

Technical biases are rife

  • Careful experimental design and a skeptical mind are key

Links

Notes

Warning

WARNING: The remaining 'slides' are notes and are not part of the presentation

Manual package curation (BioC 3.1)

Microarrays

  • BEclear (batch effects)
  • ChAMP
  • charm
  • COHCAP (+ BS-seq)
  • conumee
  • CopyNumber450k
  • DMRcate
  • DMRforPairs
  • ENmix
  • lumi
  • MethylAid
  • MethylMix
  • methylumi
  • minfi
  • missMethyl
  • shinyMethyl
  • skewr
  • wateRmelon

Sequencing

  • BiSeq
  • bsseq
  • DMRcaller
  • DSS
  • M3D
  • methylPipe
  • MethylSeekR
  • MPFE

Misc.

  • bumphunter (general)
  • coMET (visualisation of EWAS and 'co-methylation LD maps')
  • MassArray (sequenom)
  • MEDIPS (MeDIP-seq)
  • MEDME (MeDIP-microarray)
  • methVisual (clone BS-seq)
  • methyAnalysis (mostly arrays with some seq, same dev as lumi)
  • methylMnM (MeDIP-seq and MRE-seq)
  • Repitools (MeDIP-seq)
  • RnBeads (slick pipeline run with "modules", builds on other packages)